fix(reborn): remove per-record lock convoys via shared cas_update helper by henrypark133 · Pull Request #5234 · nearai/ironclaw

henrypark133 · 2026-06-25T11:53:19Z

Problem

Each Reborn persistence store wrapped its filesystem read-modify-write in a per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) held across .await — a redundant in-process serializer over backends that already do versioned CAS. Under burst, one writer stalled inside its critical section blocks every other writer for that scope (the convoy that contributed to the runtime wedge). PR #5142 removed exactly this from ironclaw_turns; this PR finishes the job for the remaining stores via one shared helper.

Change

Shared helper (ironclaw_filesystem::cas_update): one mutex-free, bounded read-modify-write loop — read versioned snapshot → idempotent apply closure → CAS put → on VersionMismatch re-read and retry (32×, jittered 2–50ms backoff, 15s timeout), with a fail-closed capability gate. Generic over record/outcome/error; leaks no store types.

Migrated all 5 stores onto it and deleted every per-record mutex:

ironclaw_turns — re-homed its local fix(turns): prevent turn-state write convoy #5142 copy onto the shared helper (one owner).
ironclaw_run_state, ironclaw_threads — dropped the mutex; route through the helper.
ironclaw_resources — gains a retry loop it never had (the lock was its only serializer).
ironclaw_secrets — deletes the Arc-keyed lock map → fixes an unbounded memory leak (it stored Arc, never pruned).

Post-condition: grep FILESYSTEM_RECORD_LOCKS / record_lock.lock().await / filesystem_secret_lock across the 5 store crates is empty.

Guardrail: .claude/rules/database.md records the invariant — filesystem read-modify-write must go through cas_update; never wrap it in a per-record mutex held across .await.

Tri-backend parity (no diverging experience by host)

The helper is RootFilesystem-level (backend-agnostic). Verified across all three ironclaw_filesystem backends:

in-memory (default): clippy 0 across the 6 crates; all store tests pass (turns/run_state/threads/resources/secrets).
postgres (--features postgres): CAS contract tests pass — postgres_native_put_cas_absent_rejects_existing_path, postgres_native_put_cas_any_increments_existing_version, postgres_transaction_rollback_discards_prior_put_after_later_cas_conflict.
libsql (--features libsql): CAS contract tests pass. (One unrelated append/dir contract test is flaky under parallel suite execution — passes 3/3 in isolation; db.rs is untouched by this PR, so it is pre-existing test-infra flakiness, not a regression.)

Notes

Concurrency-sensitive; the helper carries a high-contention storm test (N concurrent writers, no lost update), retries-exhausted, capability-gate-rejects-non-CAS, and create-if-absent.
Stacks conceptually on fix(turns): prevent turn-state write convoy #5142 (already merged). Companion to the runtime-wedge work (fix(reborn): stop WASM execution from starving the tokio worker pool #5206).

🤖 Generated with Claude Code

Extract the proven mutex-free CAS pattern from ironclaw_turns (PR #5142) into one shared helper in ironclaw_filesystem so the remaining stores can drop their per-record tokio::sync::Mutex convoys. cas_update runs a bounded read-modify-write loop: read versioned snapshot, run the caller's idempotent apply closure, CAS-put with the read version, and on VersionMismatch re-read and retry with jittered exponential backoff (2ms..50ms), capped at 32 retries and wrapped in a 15s timeout. A fail-closed capability gate rejects backends that cannot CAS rather than silently blind-overwriting. The helper is generic over the record type, the outcome, and the caller's error; it never leaks store-specific types. No store migrated yet — that follows in the next commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…te helper ironclaw_resources had NO CAS-retry loop — its per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) held across the backend get/put awaits was the ONLY serializer. Under burst, one writer stalled inside its critical section parked every other same-scope writer (the convoy that contributed to the 2026-06-24 runtime wedge). Naively deleting that mutex without a retry loop loses updates: a racing writer's single-attempt CAS detects the version mismatch and errors out (cross-process CAS contention) instead of retrying. Route update_snapshot through ironclaw_filesystem::cas_update (bounded CAS retry + jittered backoff + 15s timeout + fail-closed capability gate) and delete the per-record mutex, FilesystemRecordLock, the local PutError, and the fail-open put_with_cas (CasExpectation::Any blind-overwrite fallback). The helper fails closed instead; verified safe — these store aliases only ever resolve to CAS-capable db/in-memory backends in production. The public update API widens from FnOnce to FnMut because the helper re-runs the closure against a freshly read snapshot on every CAS retry; leaf closures that previously moved captured values now clone per invocation. Add PartialEq to BudgetGateSnapshot (helper needs S: Clone + PartialEq). Red->green regression: cas_snapshot::tests::concurrent_increments_have_no_lost_updates proves RED on lock-removed-no-retry (lost updates / spurious contention) and GREEN through the helper (every concurrent increment lands, no convoy). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ant per-record mutex run_state held BOTH a store-local CAS-retry loop AND a per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) across the backend get/put awaits — belt-and-suspenders where the mutex was a redundant in-process serializer over a backend that already does versioned CAS. Under burst the mutex convoyed same-scope writers behind one stalled writer (the 2026-06-24 wedge pattern). Route apply_update, update_status, start, and save_pending through ironclaw_filesystem::cas_update (one shared CAS implementation: bounded retry + jittered backoff + 15s timeout + fail-closed capability gate). Delete the store-local FILESYSTEM_CAS_RETRIES loop, the per-record lock + accessor + its two unit tests, the local PutError, and the fail-open put_with_cas (CasExpectation::Any blind-overwrite fallback). The scope-ownership check and the approval Pending guard move inside the re-runnable apply closure; the ApprovalStatus guard returns Err from apply. discard_pending keeps its read-then-delete, only the lock is dropped. The run_state contract tests previously ran against LocalFilesystem (byte-only, no versioned CAS) where the OLD fail-open fallback masked the missing CAS. Production never routes run_state to LocalFilesystem — only to CAS-capable db/in-memory backends — so the tests now use InMemoryBackend, matching the production capability shape (CAS) instead of a non-production blind-overwrite path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…p per-record mutex ensure_thread held the only per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) in this crate, across the backend get/put awaits, to serialize the check-then-create against Unsupported(WriteFile)-fallback backends. Route it through ironclaw_filesystem::cas_update: the apply closure returns the existing record unchanged (no-op, no write) when the thread already exists with a matching scope, or builds the fresh StoredThreadRecord when absent. A concurrent create-if-absent loser hits VersionMismatch, the helper re-reads and re-runs apply which now sees the winner and reconciles scope — exactly the old single-reconcile semantics, now via the shared CAS-retry loop without any lock. Delete the lock infra (Weak map + accessor); add PartialEq to StoredThreadRecord (helper needs S: Clone + PartialEq). The three other record/txn loops were already lock-free and are left as-is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…cord mutex (Arc leak fix) The secrets filesystem store kept FILESYSTEM_RECORD_LOCKS as a HashMap of *strong* Arc<Mutex<()>> (not Weak), one entry per (scope,lease)/session path ever seen, NEVER pruned — an unbounded memory leak. The per-record mutex was also held across the backend get/put awaits, convoying same-key writers behind one stalled writer (the 2026-06-24 wedge pattern). Route consume/revoke/consume_session_use through ironclaw_filesystem::cas_update (one shared CAS implementation), mapping the local CasDecision onto CasApply: Commit/BestEffortCommit -> changed snapshot; Settle(Ok) -> unchanged snapshot (helper skips the write via PartialEq no-op); Settle(Err)/not-found -> apply error. Crypto decrypt + use-count increment run inside the re-runnable apply closure (pure / recomputed from the freshly read record each retry). validate_session is a pure read — its lock is simply deleted. Delete the entire lock map (leak gone), cas_mutate/CasDecision/CAS_RETRY_ATTEMPTS, and the fail-open put_with_version_fallback (helper fails closed). Add PartialEq to StoredLease and StoredSession. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

PR #5142 removed the per-record mutex from ironclaw_turns but left a local copy of the CAS read-modify-write loop (apply_with_retry, put_with_cas, PutError, cas_retry_backoff + constants). Re-home it onto ironclaw_filesystem::cas_update so there is ONE CAS implementation across the codebase. Behavior is preserved: the 18 call sites and their closure shape (FnMut(InMemoryTurnStateStore) -> Fut) are unchanged; the exact timeout/exhaustion error strings are preserved. A BridgeError<T> sentinel carries the absent-record + default-snapshot no-op through the helper's apply-error channel (the helper's own no-op check only fires for Some(existing)==new; the turns store additionally must skip creating a file for an empty default store, which the old new==old check covered because read returned default on absent). The 500ms SNAPSHOT_READ_CACHE_TTL read cache is kept as a separate layer: cleared around the CAS call and repopulated on the next read (the helper does its own fresh get and does not surface the new RecordVersion; every read_snapshot caller already discards the version, so caching None is safe). Delete the now-unused local loop, put_with_cas, PutError, cas_retry_backoff, and the duplicate CAS constants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…cord mutex Add the invariant to crates/ironclaw_filesystem/CLAUDE.md (#2 "CAS is the floor") and a pointer in .claude/rules/database.md: every filesystem read-modify-write must go through the one shared ironclaw_filesystem::cas_update helper; never wrap it in a per-record tokio::sync::Mutex held across the backend .await (redundant serializer + convoy/wedge risk + leak). Also commit Cargo.lock (async-trait dev-dep added to ironclaw_resources for the convoy regression test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request migrates multiple persistence stores (ironclaw_resources, ironclaw_run_state, ironclaw_threads, ironclaw_secrets, and ironclaw_turns) to use a shared, lock-free compare-and-swap (CAS) helper (cas_update). This eliminates redundant per-record mutexes held across .await boundaries, preventing runtime-wedging convoys under high contention. Feedback on the changes highlights an opportunity to improve the jitter calculation in cas_retry_backoff using RandomState to avoid zero-jitter scenarios in low clock resolution environments (such as VMs or Windows). Additionally, it is recommended to explicitly add Send bounds to the closure generic constraints in cas_update and cas_update_loop to catch any non-Send regressions at the helper's definition site.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-25T11:56:05Z

+async fn cas_retry_backoff(attempt: usize) {
+    let shift = attempt.min(8) as u32;
+    let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);
+    let base_delay = FILESYSTEM_CAS_BACKOFF_BASE
+        .saturating_mul(multiplier)
+        .min(FILESYSTEM_CAS_BACKOFF_MAX);
+    let jitter = SystemTime::now()
+        .duration_since(UNIX_EPOCH)
+        .map(|elapsed| {
+            let jitter_ceiling = base_delay.as_millis().max(1);
+            Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64)
+        })
+        .unwrap_or_default();
+    tokio::time::sleep(base_delay.saturating_add(jitter)).await;
+}


Using SystemTime::now() for jitter calculation can lead to zero jitter on systems with low clock resolution (e.g., 1ms resolution in VMs, containers, or Windows). Since base_delay.as_millis() is at most 50, and 1,000,000 (1ms in nanoseconds) is divisible by all common backoff ceilings (2, 4, 8, 16, 32, 50), the modulo operation elapsed.as_nanos() % jitter_ceiling will always evaluate to exactly 0. This completely defeats the purpose of jitter under high contention.

Instead, use std::collections::hash_map::RandomState to obtain a high-quality, thread-safe, and un-correlated pseudo-random value seeded from the OS without adding external dependencies.

Suggested change

async fn cas_retry_backoff(attempt: usize) {

let shift = attempt.min(8) as u32;

let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);

let base_delay = FILESYSTEM_CAS_BACKOFF_BASE

.saturating_mul(multiplier)

.min(FILESYSTEM_CAS_BACKOFF_MAX);

let jitter = SystemTime::now()

.duration_since(UNIX_EPOCH)

.map(|elapsed| {

let jitter_ceiling = base_delay.as_millis().max(1);

Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64)

})

.unwrap_or_default();

tokio::time::sleep(base_delay.saturating_add(jitter)).await;

}

async fn cas_retry_backoff(attempt: usize) {

let shift = attempt.min(8) as u32;

let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);

let base_delay = FILESYSTEM_CAS_BACKOFF_BASE

.saturating_mul(multiplier)

.min(FILESYSTEM_CAS_BACKOFF_MAX);

let jitter = {

use std::collections::hash_map::RandomState;

use std::hash::{BuildHasher, Hash, Hasher};

let mut hasher = RandomState::new().build_hasher();

attempt.hash(&mut hasher);

let hash = hasher.finish();

let jitter_ceiling = base_delay.as_millis().max(1) as u64;

Duration::from_millis(hash % jitter_ceiling)

};

tokio::time::sleep(base_delay.saturating_add(jitter)).await;

}

Fixed in 0eaf69d — jitter now derives from a RandomState-seeded hash (clock-independent), so it no longer collapses to 0 on coarse-resolution clocks. See cas_retry_backoff in crates/ironclaw_filesystem/src/cas.rs.

gemini-code-assist · 2026-06-25T11:56:05Z

+pub async fn cas_update<F, S, T, E, D, N, A, Fut>(
+    filesystem: &ScopedFilesystem<F>,
+    scope: &ResourceScope,
+    path: &ScopedPath,
+    decode: D,
+    encode: N,
+    mut apply: A,
+) -> Result<T, CasUpdateError<E>>
+where
+    F: RootFilesystem + ?Sized,
+    S: PartialEq + Clone,
+    D: Fn(&[u8]) -> Result<S, E>,
+    N: Fn(&S) -> Result<Entry, E>,
+    A: FnMut(Option<S>) -> Fut,
+    Fut: Future<Output = Result<CasApply<S, T>, E>>,


According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.

pub async fn cas_update<F, S, T, E, D, N, A, Fut>( filesystem: &ScopedFilesystem<F>, scope: &ResourceScope, path: &ScopedPath, decode: D, encode: N, mut apply: A, ) -> Result<T, CasUpdateError<E>> where F: RootFilesystem + ?Sized, S: PartialEq + Clone + Send, T: Send, E: Send, D: Fn(&[u8]) -> Result<S, E> + Send, N: Fn(&S) -> Result<Entry, E> + Send, A: FnMut(Option<S>) -> Fut + Send, Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,

References

In Rust, explicitly add Send bounds to closure generic constraints in async helper functions that return futures intended to be Send. This ensures that non-Send regressions are caught at the helper's definition site rather than at the call site, even if the future is already bound by Send at the call site.

Considered but declined. Adding Send to the helper generics forces propagating Send across consumer-crate generics (run_state mutate: M, turns persistence T) — threading a bound through multiple layers for marginal "catch at the def site" benefit, which .claude/rules/architecture.md flags as bound-creep. The futures are already Send-constrained where awaited/spawned.

gemini-code-assist · 2026-06-25T11:56:05Z

+async fn cas_update_loop<F, S, T, E, D, N, A, Fut>(
+    filesystem: &ScopedFilesystem<F>,
+    scope: &ResourceScope,
+    path: &ScopedPath,
+    decode: &D,
+    encode: &N,
+    apply: &mut A,
+) -> Result<T, CasUpdateError<E>>
+where
+    F: RootFilesystem + ?Sized,
+    S: PartialEq + Clone,
+    D: Fn(&[u8]) -> Result<S, E>,
+    N: Fn(&S) -> Result<Entry, E>,
+    A: FnMut(Option<S>) -> Fut,
+    Fut: Future<Output = Result<CasApply<S, T>, E>>,


According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.

async fn cas_update_loop<F, S, T, E, D, N, A, Fut>( filesystem: &ScopedFilesystem<F>, scope: &ResourceScope, path: &ScopedPath, decode: &D, encode: &N, apply: &mut A, ) -> Result<T, CasUpdateError<E>> where F: RootFilesystem + ?Sized, S: PartialEq + Clone + Send, T: Send, E: Send, D: Fn(&[u8]) -> Result<S, E> + Send, N: Fn(&S) -> Result<Entry, E> + Send, A: FnMut(Option<S>) -> Fut + Send, Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,

References

In Rust, explicitly add Send bounds to closure generic constraints in async helper functions that return futures intended to be Send. This ensures that non-Send regressions are caught at the helper's definition site rather than at the call site, even if the future is already bound by Send at the call site.

Same decision as the cas_update bound above — declined to avoid propagating Send across consumer-crate generics for marginal benefit.

coderabbitai · 2026-06-25T11:58:10Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Filesystem-backed read-modify-write paths now use cas_update with fail-closed gating, bounded retries, and timeout handling. Run-state, resources, secrets, threads, and turns were migrated off local lock/CAS loops, and discarded approval-state handling was updated.

Changes

Shared filesystem CAS migration

Layer / File(s)	Summary
Contract and public API surface `.claude/rules/database.md`, `crates/ironclaw_filesystem/CLAUDE.md`, `docs/plans/2026-06-25-cas-migration.md`, `crates/ironclaw_filesystem/src/scoped.rs`, `crates/ironclaw_filesystem/src/lib.rs`, `crates/ironclaw_filesystem/Cargo.toml`, `crates/ironclaw_resources/Cargo.toml`, `crates/ironclaw_run_state/tests/approval_resolution_contract.rs`, `crates/ironclaw_capabilities/src/helpers.rs`	Filesystem invariants and migration notes now require `cas_update`; `ScopedFilesystem` exposes capabilities; the crate root exports the CAS helper, types, and timing constants; and discarded approval-state mapping is added at the capability layer.
cas_update loop and tests `crates/ironclaw_filesystem/src/cas.rs`, `crates/ironclaw_filesystem/src/cas/tests.rs`	The shared helper performs capability-gated retries, timeout, backoff, and mapped error handling, with behavior covered by tests.
CasSnapshotStore and resource retries `crates/ironclaw_resources/src/cas_snapshot.rs`, `crates/ironclaw_resources/src/filesystem_store.rs`, `crates/ironclaw_resources/src/lib.rs`	`CasSnapshotStore`, resource-governor updates, and budget-gate updates switch to shared CAS retries and re-runnable closures, and the affected snapshot types derive `PartialEq`.
Run-state record updates `crates/ironclaw_run_state/src/lib.rs`, `crates/ironclaw_run_state/tests/*`, `crates/ironclaw_product_workflow/src/approval_interaction/service.rs`	`FilesystemRunStateStore` and approval-request storage move creation, mutation, discard tombstones, and record filtering onto `cas_update` while removing the local lock map and legacy CAS write helpers; discarded-state handling and tests align to the new status.
Secret, broker, and thread CAS updates `crates/ironclaw_secrets/src/filesystem_store.rs`, `crates/ironclaw_threads/src/filesystem_service.rs`, `crates/ironclaw_host_runtime/tests/reborn_durable_restart_integration.rs`	Secret, credential-broker, and thread persistence paths now use shared CAS retries, with updated derives, record kinds, error mapping, and in-memory backend wiring for durable test coverage.
Turn-state CAS update and cache handling `crates/ironclaw_turns/src/filesystem_store.rs`, `crates/ironclaw_turns/src/filesystem_store/tests.rs`, `crates/ironclaw_turns/tests/filesystem_turn_state_contract.rs`, `tests/support/reborn/*`	Turn-state writes route through `cas_update` with cache clearing, overlay preservation, and no-op handling, and the regression tests and harness wiring cover the resulting cache/version behavior.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Migrate ironclaw_turns runner-lease sidecar CAS loops onto shared cas_update #5274: Same ironclaw_turns CAS migration, including FilesystemTurnStateStore::apply and runner-lease sidecar behavior.

Possibly related PRs

nearai/ironclaw#5002: Direct overlap in crates/ironclaw_threads/src/filesystem_service.rs, including thread record persistence and update flow changes.
nearai/ironclaw#5142: Same turn-state snapshot write path migration to bounded CAS-style retry behavior in crates/ironclaw_turns/src/filesystem_store.rs.
nearai/ironclaw#5232: Adjacent ironclaw_turns runner-lease sidecar and CAS/heartbeat overlay behavior.

Suggested reviewers

think-in-universe

Poem

A rabbit hopped through retry light,
Where CAS kept each old write in sight.
Discarded gates went soft and still,
While snapshots danced with bounded will.
🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description covers the change, but it omits most required template sections like change type, linked issue, validation, rollback, and review track.	Reformat it to the repo template and add the missing sections: Summary, Change Type, Linked Issue, Validation, Security Impact, Trust-Boundary Checklist, Database Impact, Blast Radius, Rollback Plan, Review Follow-Through, and Review track.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: replacing per-record lock convoys with the shared cas_update helper.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

chatgpt-codex-connector

💡 Codex Review

ironclaw/crates/ironclaw_run_state/src/lib.rs

Line 932 in 3223540

self.filesystem.delete(scope, &path).await?;

Guard discard deletes with CAS

When discard_pending races in the same process with approve or deny for the same request, this unconditional delete can now run after the resolver's CAS-protected status update and remove the terminal approval record. The removed per-record lock used to serialize that get/check/delete sequence with status updates; without either that lock or a CAS/tombstone transition here, a user approval can be lost and later look like UnknownApprovalRequest rather than Approved.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/ironclaw_run_state/src/lib.rs (1)
922-932: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Make discard_pending atomic before dropping the lock.

This path reads Pending, then deletes later with no CAS expectation. A concurrent approve()/deny() can win the CAS update between Line 922 and Line 932, then this delete removes a non-pending approval record. Use a CAS-aware delete/transactional transition, or model discard as a CAS status update instead of read-then-delete. As per path instructions: “Fail loud” and filesystem read-modify-write paths must use the CAS invariant rather than split multi-step state changes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_run_state/src/lib.rs` around lines 922 - 932, The
discard_pending flow in RunState::discard_pending is currently split into a
read/validate step and a later filesystem delete, which can race with
approve()/deny() and remove a record after its status has already changed.
Update this path to be atomic by using a CAS-aware transition or transactional
delete that enforces the same status invariant at the point of mutation, rather
than relying on a prior Pending check; keep the failure path loud if the
expected state no longer matches.
Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_filesystem/src/cas.rs`:
- Line 193: The new Clippy allow on cas.rs needs the required arch-exempt
rationale comment immediately above #[allow(clippy::too_many_arguments)]. Update
the same site near the cas:: function signature that triggered the lint by
adding the mandatory // arch-exempt: too_many_args, <reason naming the missing
aggregation>, plan `#NNNN` annotation, and make sure the reason justifies why this
many parameters is unavoidable until the missing aggregation is introduced.
- Around line 295-306: The CAS preflight is treating
BackendCapabilities::default() as “unknown,” which lets a backend with no
advertised capabilities bypass the fail-closed gate. Update capabilities_known()
in cas.rs to only treat an explicit unknown state as unknown and not overload
BackendCapabilities::empty()/default(), then make the cas_update path rely on
that check so unsupported backends still return CasUpdateError::CasUnsupported
instead of falling back to op-time behavior.
- Around line 262-265: The no-op shortcut in cas_update_loop currently only
returns early when current is Some(existing) and matches the snapshot, while
absent records still fall through to encoding and writing; update the logic to
either treat None plus an unchanged/default snapshot as a true no-op or adjust
the CasApply/cas_update contract and surrounding comments to state the shortcut
only applies to existing records. Use the existing cas_update_loop and CasApply
symbols to keep the behavior and docs aligned, and remove any need for
downstream NoOp suppression if you choose to enforce it here.

In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 184-399: Add a regression test in cas/tests.rs for the timeout
path in cas_update, since the current tests cover retries, CAS rejection, and
apply errors but never exercise CasUpdateError::Timeout. Introduce a tiny wedged
backend or stub that blocks one of the CAS operations (get, put, or apply), and
use paused Tokio time so the outer timeout in cas_update deterministically
fires. Reference cas_update and CasUpdateError::Timeout in the new test, and
assert that the timeout branch wins over the hung backend behavior.

In `@crates/ironclaw_resources/src/cas_snapshot.rs`:
- Line 212: The CasUpdateError::Backend handling in cas_snapshot.rs currently
forwards the backend error string directly via E::storage_from(inner), which can
leak virtual paths and raw backend details into StorageError. Update this
mapping to return a stable sanitized public message from the error conversion
path, and preserve the detailed backend error only in internal diagnostics or
source context associated with E::storage_from.

In `@crates/ironclaw_run_state/src/lib.rs`:
- Line 1109: The `CasUpdateError::Backend` branch in `RunStateError` currently
converts the filesystem failure to a string, which drops the structured
`FilesystemError` context. Update this match arm in `lib.rs` to preserve the
typed error by using the existing `?`-style conversion path or direct typed
`From`/`Into` conversion used elsewhere in this file, so the `FilesystemError`
variant and its path/operation details continue to propagate through
`RunStateError::Filesystem`.

In `@crates/ironclaw_secrets/src/filesystem_store.rs`:
- Around line 479-483: The no-op CAS comment in filesystem_store.rs is
inaccurate: the `already_marked` branch in the lease update path returns
`Err(SecretStoreError::LeaseExpired { lease_id })` and goes through
`CasUpdateError::Apply`, not the unchanged record/`PartialEq` skip path. Update
the comment to describe the actual intent, or change the branch in the relevant
lease/CAS helper to return `CasApply::new(lease, Err(...))` if that is the
desired behavior; use the `already_marked`, `SecretLeaseStatus::Expired`, and
`CasUpdateError::Apply` symbols to locate and align the code with the documented
contract.

In `@crates/ironclaw_turns/src/filesystem_store.rs`:
- Around line 378-385: The CAS error mapper in map_cas_error should not panic on
BridgeError::NoOp, since this is production error-handling code. Replace the
unreachable! fallback in map_cas_error with a typed unavailable TurnError that
preserves context about the unexpected NoOp leak, while keeping the
BridgeError::Real(inner) path unchanged and consistent with CasUpdateError
handling.

In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 137-140: The quality gate in the migration plan is outdated and
should match the repo’s required backend validation for persistence changes.
Update the “Quality gate” section to include the feature-isolation `cargo check`
matrix for the default/postgres build, `--no-default-features --features
libsql`, and `--all-features`, and strengthen the clippy step to the full `cargo
clippy --all --benches --tests --examples --all-features -- -D warnings`. Keep
the existing formatting/tests coverage, but ensure the plan explicitly reflects
the required checks for the legacy persistence migration.

---

Outside diff comments:
In `@crates/ironclaw_run_state/src/lib.rs`:
- Around line 922-932: The discard_pending flow in RunState::discard_pending is
currently split into a read/validate step and a later filesystem delete, which
can race with approve()/deny() and remove a record after its status has already
changed. Update this path to be atomic by using a CAS-aware transition or
transactional delete that enforces the same status invariant at the point of
mutation, rather than relying on a prior Pending check; keep the failure path
loud if the expected state no longer matches.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 1c594037-4929-451d-97ad-73e096ed325d

📥 Commits

Reviewing files that changed from the base of the PR and between 44f063d and 3223540.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock, !**/Cargo.lock

📒 Files selected for processing (17)

.claude/rules/database.md
crates/ironclaw_filesystem/CLAUDE.md
crates/ironclaw_filesystem/src/cas.rs
crates/ironclaw_filesystem/src/cas/tests.rs
crates/ironclaw_filesystem/src/lib.rs
crates/ironclaw_filesystem/src/scoped.rs
crates/ironclaw_resources/Cargo.toml
crates/ironclaw_resources/src/cas_snapshot.rs
crates/ironclaw_resources/src/filesystem_store.rs
crates/ironclaw_resources/src/lib.rs
crates/ironclaw_run_state/src/lib.rs
crates/ironclaw_run_state/tests/approval_resolution_contract.rs
crates/ironclaw_run_state/tests/run_state_contract.rs
crates/ironclaw_secrets/src/filesystem_store.rs
crates/ironclaw_threads/src/filesystem_service.rs
crates/ironclaw_turns/src/filesystem_store.rs
docs/plans/2026-06-25-cas-migration.md

railway-app · 2026-06-25T12:13:10Z

🚅 Deployed to the ironclaw-pr-5234 environment in ironclaw-ci-preview

Service	Status	Web	Updated (UTC)
ironclaw	✅ Success (View Logs)	Web	Jun 26, 2026 at 4:59 pm

Fold in PR review (henrypark133) + a durable-write hot-path audit. Adds §12 framing the deployed lease-expiry cascade as three independent layers: - Layer A: in-process lock convoy → #5234 (open); write-behind composes on the post-#5234 CAS path. - Layer B: pool starvation — DEFAULT_POSTGRES_POOL_MAX_SIZE=2 shared across all Postgres FS I/O. Notes the existing 30s checkout guard (closes the infinite hang) but flags pool-too-small + checkout(30s)>apply(15s); cheap mitigations (raise pool, reserve a critical connection, align checkout<apply) prior to and complementary with write-behind. - Layer C: the synchronous hot-path write map (events / governor / thread-append / lease / memory) with the write-behind-vs-batch-coalesce distinction. Events are the top batch-coalesce target (highest churn, O(1) INSERT no CAS); must stay DURABLE (source of truth), per-step flush to preserve live SSE. Memory stays synchronous (FTS-only, no embedding write; agent-initiated, low churn). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Resolve PR #5234 review comments across the shared cas_update helper and its consumers: - cas: replace SystemTime-based jitter (collapses to 0 on coarse clocks) with a RandomState-seeded value; drop the now-unnecessary too_many_arguments allow (helper has 6 args, below clippy threshold). - cas: add explicit CasApply::no_op so absent-record no-ops are signalled directly; make the contract doc true. Deletes the BridgeError::NoOp "success-through-an-error-channel" hack in ironclaw_turns (E is now TurnError directly) and simplifies map_cas_error. - run_state: fix a TOCTOU race in FilesystemApprovalRequestStore:: discard_pending (Codex P1). The non-atomic get->check->delete could remove a record an approve()/deny() CAS had just transitioned. Route discard through cas_update as a version-checked tombstone transition (new ApprovalStatus::Discarded); get/records_for_scope filter it so it still reads as gone. Adds a no-clobber regression test. - turns/secrets: correct inaccurate CAS no-op comment; replace an unreachable!() panic in map_cas_error with a typed Unavailable error. - add a deterministic CasUpdateError::Timeout regression test (paused Tokio time + hanging backend). - docs: restore the feature-isolation cargo check matrix and full clippy gate in the migration plan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/ironclaw_product_workflow/src/approval_interaction/service.rs (1)
158-164: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Classify discarded approvals as stale before the persistent branch.

For persistent == true, ApprovalStatus::Discarded hits Line 163 and returns AlwaysAllowUnsupported, so the new stale arm at Line 199 is unreachable for that path.
Proposed fix
-        if matches!(status, ApprovalStatus::Denied | ApprovalStatus::Expired) {
+        if matches!(
+            status,
+            ApprovalStatus::Denied | ApprovalStatus::Expired | ApprovalStatus::Discarded
+        ) {
As per coding guidelines, WebUI-facing facade errors must expose the stable sanitized taxonomy for blocked approval/auth/resource and stale/conflict states.

Also applies to: 199-199
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs` around
lines 158 - 164, The approval gating in
approval_interaction::service::ApprovalInteractionService should classify
ApprovalStatus::Discarded as stale before the persistent-only unsupported
branch. Update the early status checks around the approval_rejected flow so the
stale rejection path is taken for Discarded regardless of persistent, and ensure
the later stale arm remains reachable with the sanitized StaleGate taxonomy
instead of falling through to AlwaysAllowUnsupported.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_filesystem/src/cas.rs`:
- Around line 102-114: The no-op early returns in cas_update can return a
snapshot that is no longer current, which then gets cached by
filesystem_store::apply_from_snapshot and similar callers. Update cas_update so
the equality and explicit no_op branches either revalidate the record/version
before returning or avoid exposing a cacheable “final” snapshot from those
paths; use the cas_update, CasApply::new, and CasApply::no_op branches as the
main touchpoints. Also adjust the surrounding contract/comments so they only
promise what is actually enforced after concurrent writes.

In `@crates/ironclaw_run_state/tests/run_state_contract.rs`:
- Around line 491-510: The regression test in
filesystem_discard_does_not_clobber_resolved_approval is still sequential and
does not force the TOCTOU interleaving it describes. Update this test to use a
store/filesystem harness that can pause discard_pending after it reads a Pending
record, then let approve run on the same request before discard_pending resumes
its write/CAS path. Use FilesystemApprovalRequestStore and
discard_pending/approve to verify the resolved approval is preserved and the
terminal record is not clobbered.

In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 141-145: The quality gate currently uses cargo fmt --all, which
reformats files instead of validating CI-style formatting; update the checklist
entry in the plan to use the repo’s check-mode formatting command instead. Make
this change in the validation list alongside the other cargo check/clippy steps
so the migration plan matches the “Before You Open a PR” requirement and remains
authoritative.

---

Outside diff comments:
In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs`:
- Around line 158-164: The approval gating in
approval_interaction::service::ApprovalInteractionService should classify
ApprovalStatus::Discarded as stale before the persistent-only unsupported
branch. Update the early status checks around the approval_rejected flow so the
stale rejection path is taken for Discarded regardless of persistent, and ensure
the later stale arm remains reachable with the sanitized StaleGate taxonomy
instead of falling through to AlwaysAllowUnsupported.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: e8a5fca8-3068-48d7-92df-53d101912fa5

📥 Commits

Reviewing files that changed from the base of the PR and between 3223540 and 0eaf69d.

📒 Files selected for processing (10)

crates/ironclaw_capabilities/src/helpers.rs
crates/ironclaw_filesystem/Cargo.toml
crates/ironclaw_filesystem/src/cas.rs
crates/ironclaw_filesystem/src/cas/tests.rs
crates/ironclaw_product_workflow/src/approval_interaction/service.rs
crates/ironclaw_run_state/src/lib.rs
crates/ironclaw_run_state/tests/run_state_contract.rs
crates/ironclaw_secrets/src/filesystem_store.rs
crates/ironclaw_turns/src/filesystem_store.rs
docs/plans/2026-06-25-cas-migration.md

Reconcile the CAS-migration work with main, notably PR #5232 (durable runner-lease sidecar) which independently rewrote ironclaw_turns/filesystem_store.rs — splitting it into io/profile_resolver/ projection/runner_lease submodules and adding a per-run lease sidecar. Resolution of crates/ironclaw_turns/src/filesystem_store.rs: - Keep main's module split + runner-lease sidecar feature. - Keep this PR's routing of the turn-state snapshot RMW through the shared ironclaw_filesystem::cas_update helper (no inline CAS loop, no BridgeError; E is TurnError directly; CasApply::no_op for the inert-snapshot no-op). - The cas_update apply callback applies the runner-lease overlay per retry and uses the OVERLAID snapshot as the no-op baseline, matching main's pre-merge `new_snapshot == old_snapshot` semantics so an active lease overlay does not force a spurious write each apply. Covered by a new regression test (filesystem_turn_state_store_no_op_under_active_lease_overlay_does_not_rewrite_snapshot). - service.rs: add ApprovalStatus::Discarded to the early stale-gate guard for consistency with the exhaustive match arms. Known scoped exception, tracked in #5274: the runner-lease sidecar still uses a local put_with_cas + cas_retry_backoff loop rather than cas_update. Documented in crates/ironclaw_filesystem/CLAUDE.md; the main snapshot RMW does go through cas_update. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

henrypark133 · 2026-06-25T23:32:18Z

Addressed the Codex P1 (and CodeRabbit's matching outside-diff note) on discard_pending racing approve()/deny().

Fixed in 0eaf69d: the non-atomic get → check Pending → delete is replaced with a version-checked CAS transition through cas_update to a new ApprovalStatus::Discarded tombstone (mirrors update_status). If an approve()/deny() wins the CAS first, the discard's put hits VersionMismatch, re-reads, sees a non-Pending status, and returns ApprovalNotPending — so a resolved approval is never clobbered. get and records_for_scope filter Discarded, so a discarded request still reads as gone (contract unchanged). The in-memory store was left as-is — its get→check→remove is already atomic under one mutex.

Regression test added: filesystem_discard_does_not_clobber_resolved_approval (approve, then assert discard_pending returns ApprovalNotPending and the record stays Approved).

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

crates/ironclaw_product_workflow/src/approval_interaction/service.rs (2)
312-332: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Use debug! for these fallback diagnostics.

These new warn! calls are internal diagnostics on a best-effort path. In this repo, warn!/info! from runtime flows are reserved out because they corrupt the interactive REPL/TUI display. As per path instructions, REPL/TUI logging: info!/warn! corrupt the terminal UI — internal diagnostics use debug!.
Suggested diff
-            tracing::warn!(
+            tracing::debug!(
                 error = %error,
                 capability_id = %policy.key.capability_id,
                 action = ?policy.key.action,
                 "persistent approval policy write failed after approval resolution"
             );
@@
-            tracing::warn!(
+            tracing::debug!(
                 error = %error,
                 capability_id = %override_key.capability_id,
                 "tool permission override clear failed after persistent approval"
             );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs` around
lines 312 - 332, These fallback diagnostics in approval_interaction::service,
specifically the tracing calls inside the persistent_policies.allow and
tool_permission_overrides.clear error paths, should use debug! instead of warn!
because runtime warn/info logging can disrupt the REPL/TUI. Update both logging
sites to debug! while keeping the same error and capability context so the
internal best-effort failure reporting remains available without corrupting
terminal UI.
Source: Path instructions

306-333: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Route persistent-approval side effects through the approval layer.

This service now calls PersistentApprovalPolicyStore::allow() and ToolPermissionOverrideStore::clear() directly from ironclaw_product_workflow. That bypasses the approval boundary and leaves the durable allow + override-clear sequence owned by product code instead of the approval layer that already owns resolution semantics. Move this behind ApprovalResolutionPort (or a dedicated approval-layer port) and invoke that here. As per path instructions, Approve/deny decisions must go through ApprovalResolutionPort and TurnCoordinator; product/WebUI code must not directly execute tools or mutate approval stores ad hoc.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs` around
lines 306 - 333, Route the durable approval side effects out of
approval_interaction::service::persist_allow_policy and behind the approval
boundary instead of calling PersistentApprovalPolicyStore::allow() and
ToolPermissionOverrideStore::clear() directly. Introduce or use an
ApprovalResolutionPort-owned method that performs the allow-then-clear sequence,
and have persist_allow_policy delegate to that port so the approval layer
remains the single owner of resolution semantics and store mutation.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs`:
- Around line 312-332: These fallback diagnostics in
approval_interaction::service, specifically the tracing calls inside the
persistent_policies.allow and tool_permission_overrides.clear error paths,
should use debug! instead of warn! because runtime warn/info logging can disrupt
the REPL/TUI. Update both logging sites to debug! while keeping the same error
and capability context so the internal best-effort failure reporting remains
available without corrupting terminal UI.
- Around line 306-333: Route the durable approval side effects out of
approval_interaction::service::persist_allow_policy and behind the approval
boundary instead of calling PersistentApprovalPolicyStore::allow() and
ToolPermissionOverrideStore::clear() directly. Introduce or use an
ApprovalResolutionPort-owned method that performs the allow-then-clear sequence,
and have persist_allow_policy delegate to that port so the approval layer
remains the single owner of resolution semantics and store mutation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: a098d8a1-20cb-419f-920f-fe4f83865273

📥 Commits

Reviewing files that changed from the base of the PR and between 0eaf69d and 2097c6d.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock, !**/Cargo.lock

📒 Files selected for processing (5)

crates/ironclaw_filesystem/CLAUDE.md
crates/ironclaw_product_workflow/src/approval_interaction/service.rs
crates/ironclaw_turns/src/filesystem_store.rs
crates/ironclaw_turns/src/filesystem_store/tests.rs
crates/ironclaw_turns/tests/filesystem_turn_state_contract.rs

💤 Files with no reviewable changes (3)

crates/ironclaw_turns/src/filesystem_store/tests.rs
crates/ironclaw_turns/tests/filesystem_turn_state_contract.rs
crates/ironclaw_turns/src/filesystem_store.rs

henrypark133

Code Review (multi-agent)

Intent: Replace per-record filesystem mutexes with a shared CAS update helper across Reborn stores to eliminate lock convoys and related leaks.

Stats: 6 findings kept (from 8 raw, 6 after filtering/dedup) across 3 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 1. Filtered: 2 low-signal local-pattern doc notes.

All surviving findings are Medium severity, so I am leaving this as a comment review rather than requesting changes.

Security

Medium Discarded approvals now keep the full request payload on disk (crates/ironclaw_run_state/src/lib.rs:947-956, confidence 88) — anchor: crates/ironclaw_run_state/src/lib.rs:947
discard_pending now rewrites the filesystem record as Discarded instead of deleting it, but it keeps the full original ApprovalRequest body in the tombstone. That means a cancelled approval's action, reason, requester, and fingerprint remain recoverable from disk/backups even though public APIs hide discarded records. This is a data-minimization regression on an approval control-plane record.
Fix: Store only the minimal tombstone data needed to reserve the request ID, or keep the discarded-ID reservation in a compact side index instead of retaining the full approval payload.

Tests

Medium Missing test for op-time unsupported-write fail-closed path (crates/ironclaw_filesystem/src/cas.rs:337-340, confidence 95) — anchor: crates/ironclaw_filesystem/src/cas.rs:337
The helper covers the pre-flight non-CAS rejection, but the second fail-closed path is untested: an unknown/default-capability backend can read successfully and then return FilesystemError::Unsupported { operation: WriteFile } from put. That branch maps to CasUpdateError::CasUnsupported and should be protected because it is the composite-router fallback path.
Fix: Add tests::cas::unsupported_write_file_maps_to_cas_unsupported using a default-capabilities backend whose read succeeds and whose CAS write returns Unsupported(WriteFile).

Performance / Concurrency

Medium CAS helper clones the full snapshot on every retry (crates/ironclaw_filesystem/src/cas.rs:299-307, confidence 86) — anchor: crates/ironclaw_filesystem/src/cas.rs:306
cas_update clones the decoded snapshot before every apply, and several migrated call sites clone again when building CasApply::new(...). For large snapshots such as turn state, each write now copies the whole record at least twice, and contention multiplies that cost across the retry loop. That is avoidable churn on the hot persistence path this PR is trying to improve.
Fix: Pass apply a borrowed snapshot or split the helper API so the helper retains one owned copy for equality/write checks while call sites clone only when they truly need ownership.

Conventions

Medium Filesystem discard now blocks ID reuse, but the in-memory store does not (crates/ironclaw_run_state/src/lib.rs:947-956, confidence 82) — anchor: crates/ironclaw_run_state/tests/run_state_contract.rs:459 (no diff position — body only)
The filesystem store now writes a Discarded tombstone, so a later save_pending with the same request ID is rejected even though get and records_for_scope hide the record. The in-memory store still removes the record outright, so tests and early wiring can accept a reused approval ID that production rejects. That diverges the store contract introduced by this PR.
Fix: Make InMemoryApprovalRequestStore::discard_pending preserve a discarded tombstone, or separately track discarded IDs so save_pending rejects reused IDs in both backends.

Maintainability

Medium Turn-state apply has become a 1k-line orchestration knot (crates/ironclaw_turns/src/filesystem_store.rs:381-533, confidence 95) — anchor: crates/ironclaw_turns/src/filesystem_store.rs:381
This refactor pushes FilesystemTurnStateStore from 961 to 1051 lines, and the new apply flow now mixes cache invalidation, runner-lease overlay, transient store construction, FnMut adaptation, CAS no-op detection, timeout policy, and cache update/rollback in one method. That is a lot of state to hold in one control flow, especially in the turn-state persistence path.
Fix: Split the CAS-specific orchestration into smaller helpers, or move runner-lease overlay/cache bookkeeping behind a focused helper so apply coordinates the transition at a higher level.
Medium The apply timeout is now defined in two places (crates/ironclaw_turns/src/filesystem_store.rs:77-77, confidence 86) — anchor: crates/ironclaw_turns/src/filesystem_store.rs:77
The turn store keeps its own 15-second FILESYSTEM_APPLY_TIMEOUT, while the shared CAS helper now owns the same 15-second loop timeout. The outer timeout comment says this may be shorter for tests, but the default value is still duplicated, so a future timeout-policy change can drift the turn store away from the helper contract.
Fix: Use the shared ironclaw_filesystem::FILESYSTEM_APPLY_TIMEOUT as the turn-store default, or remove the outer default when the helper deadline is the only intended policy.

henrypark133 · 2026-06-26T03:26:02Z

+                            status: record.status,
+                        });
+                    }
+                    // Write a Discarded tombstone so the file still exists (preventing


Medium — Discarded approvals now keep the full request payload on disk.

discard_pending now rewrites the filesystem record as Discarded instead of deleting it, but it keeps the full original ApprovalRequest body in the tombstone. That means a cancelled approval's action, reason, requester, and fingerprint remain recoverable from disk/backups even though public APIs hide discarded records. This is a data-minimization regression on an approval control-plane record.

Fix: Store only the minimal tombstone data needed to reserve the request ID, or keep the discarded-ID reservation in a compact side index instead of retaining the full approval payload.

Deferred (consistency, not a discard-specific bug). ApprovalRecord.request is non-optional and Approved/Denied records already retain the identical full payload on disk — the Discarded tombstone is consistent with every other terminal state. Singling out discard for minimization would be the inconsistency. If approval-payload retention matters it's a cross-cutting policy (pruning/redaction of all terminal approval records) plus a schema change (optional/redacted request), which is out of scope for this PR. Tracking as a retention follow-up.

henrypark133 · 2026-06-26T03:26:02Z

+                cas_retry_backoff(attempt).await;
+            }
+            // 5b. Backend cannot CAS-write — fail closed (no blind overwrite).
+            Err(FilesystemError::Unsupported {


Medium — Missing test for op-time unsupported-write fail-closed path.

The helper covers the pre-flight non-CAS rejection, but the second fail-closed path is untested: an unknown/default-capability backend can read successfully and then return FilesystemError::Unsupported { operation: WriteFile } from put. That branch maps to CasUpdateError::CasUnsupported and should be protected because it is the composite-router fallback path.

Fix: Add tests::cas::unsupported_write_file_maps_to_cas_unsupported using a default-capabilities backend whose read succeeds and whose CAS write returns Unsupported(WriteFile).

Fixed in 578e971 — added unsupported_write_file_maps_to_cas_unsupported: a default/unknown-capability backend whose get succeeds and whose put returns Unsupported{WriteFile}, asserting cas_update maps it to CasUpdateError::CasUnsupported. This exercises the op-time fallback path, distinct from the already-tested pre-flight rejection.

henrypark133 · 2026-06-26T03:26:02Z

+            snapshot,
+            outcome,
+            write,
+        } = apply(current.clone())


Medium — CAS helper clones the full snapshot on every retry.

cas_update clones the decoded snapshot before every apply, and several migrated call sites clone again when building CasApply::new(...). For large snapshots such as turn state, each write now copies the whole record at least twice, and contention multiplies that cost across the retry loop. That is avoidable churn on the hot persistence path this PR is trying to improve.

Fix: Pass apply a borrowed snapshot or split the helper API so the helper retains one owned copy for equality/write checks while call sites clone only when they truly need ownership.

Deferred (low-value perf). Confirmed two in-memory clones per attempt: cas_update clones current (needed — retained for the equality check after apply consumes its copy) and the turns call site clones new_snapshot (threaded into outcome to populate the read cache). Both are struct copies, not IO, and the loop only re-clones on VersionMismatch (rare). Eliminating them requires an API change (return the final snapshot from cas_update, or hash-based equality) for marginal gain — premature without a profile showing turn-state writes are hot. Will revisit with a benchmark.

henrypark133 · 2026-06-26T03:26:02Z

+        // there is no lock contention in practice.
+        let apply = Arc::new(Mutex::new(apply));
+
+        let cas_future = cas_update(


Medium — Turn-state apply has become a 1k-line orchestration knot.

This refactor pushes FilesystemTurnStateStore from 961 to 1051 lines, and the new apply flow now mixes cache invalidation, runner-lease overlay, transient store construction, FnMut adaptation, CAS no-op detection, timeout policy, and cache update/rollback in one method. That is a lot of state to hold in one control flow, especially in the turn-state persistence path.

Fix: Split the CAS-specific orchestration into smaller helpers, or move runner-lease overlay/cache bookkeeping behind a focused helper so apply coordinates the transition at a higher level.

Agreed it's doing a lot, deferred to #5274. The clean move is extracting the per-retry transform (overlay + store build + no-op detection) into a named helper so apply reads as orchestration. Follow-up #5274 (migrate the runner-lease sidecar onto cas_update) already rewrites this exact apply/CAS path, so the decomposition lands there — once — rather than re-churning a just-merged file twice. Added to #5274's scope. (Note: most of the 961→1051 growth is main's own module split landing via the merge; net the file is smaller than this branch's pre-merge 1455.)

henrypark133 · 2026-06-26T03:26:02Z

+/// unrelated callers behind one wedged operation.
+pub const FILESYSTEM_CAS_RETRIES: usize = 32;
+/// Deadline for the entire read-modify-write loop (including all retries).
+pub const FILESYSTEM_APPLY_TIMEOUT: Duration = Duration::from_secs(15);


Medium — The apply timeout is now defined in two places.

The turn store keeps its own 15-second FILESYSTEM_APPLY_TIMEOUT, while the shared CAS helper now owns the same 15-second loop timeout. The outer timeout comment says this may be shorter for tests, but the default value is still duplicated, so a future timeout-policy change can drift the turn store away from the helper contract.

Fix: Use the shared ironclaw_filesystem::FILESYSTEM_APPLY_TIMEOUT as the turn-store default, or remove the outer default when the helper deadline is the only intended policy.

Fixed in 578e971 — the turn store now uses the shared ironclaw_filesystem::FILESYSTEM_APPLY_TIMEOUT instead of a duplicated local constant; the per-store override via with_apply_timeout (used in tests) is unchanged.

- cas/tests.rs: add `unsupported_write_file_maps_to_cas_unsupported` covering the op-time fail-closed path (default-capability backend reads OK, then `put` returns Unsupported{WriteFile} → CasUpdateError::CasUnsupported) — previously only the pre-flight non-CAS rejection was tested. - run_state tests: replace the sequential discard regression with a deterministic TOCTOU harness (RootFilesystem decorator injects a racing approve() during discard's read→write window) so the CAS-retry-after-version- bump path is actually exercised; discard loses the race and returns ApprovalNotPending without clobbering the resolved record. - turns/filesystem_store.rs: use the shared ironclaw_filesystem:: FILESYSTEM_APPLY_TIMEOUT instead of a duplicated local 15s constant. - docs/plans: quality gate uses check-mode `cargo fmt --all -- --check`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/plans/2026-06-25-cas-migration.md (1)
141-146: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add cargo build to the quality gate.

Line 141 updates the checklist, but the repo’s required pre-PR validation still includes cargo build. Omitting it leaves this migration plan weaker than the documented submission gate.

As per coding guidelines, "Before You Open a PR" requires cargo build.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/2026-06-25-cas-migration.md` around lines 141 - 146, The quality
gate checklist is missing the required build validation, so update the plan’s
validation list to include cargo build alongside the existing cargo
fmt/check/clippy/test commands. Make the change in the checklist section that
enumerates the verification steps so the migration plan matches the repo’s
pre-PR submission gate.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 511-582: The test only checks the საბოლოary
CasUpdateError::CasUnsupported, so it can still pass even if cas_update fails
earlier in the pre-flight capability check and never reaches
UnsupportedWriteBackend::put. Update the UnsupportedWriteBackend test helper to
record when put is invoked, and in
unsupported_write_file_maps_to_cas_unsupported assert both that cas_update
returns CasUpdateError::CasUnsupported and that put was actually called, so the
op-time fallback path is proven.

---

Outside diff comments:
In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 141-146: The quality gate checklist is missing the required build
validation, so update the plan’s validation list to include cargo build
alongside the existing cargo fmt/check/clippy/test commands. Make the change in
the checklist section that enumerates the verification steps so the migration
plan matches the repo’s pre-PR submission gate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 947d1e05-ffb2-4cfc-ac10-331d3eba110a

📥 Commits

Reviewing files that changed from the base of the PR and between 2097c6d and 578e971.

📒 Files selected for processing (5)

crates/ironclaw_filesystem/src/cas/tests.rs
crates/ironclaw_product_workflow/src/approval_interaction/service.rs
crates/ironclaw_run_state/tests/run_state_contract.rs
crates/ironclaw_turns/src/filesystem_store.rs
docs/plans/2026-06-25-cas-migration.md

henrypark133

Code Review (multi-agent)

Intent: Replace per-record filesystem mutex convoys with a shared CAS update helper and migrate all Reborn stores to it.

Stats: 4 findings kept (from 5 raw, 4 after filtering/dedup) across 2 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0.

All surviving findings are Medium/Low, so this is a comment review rather than a request for changes. I did not repost prior threads that already have fixed replies or sound deferrals; the turn-state apply-structure concern is covered by the existing #5274 deferral.

Tests

Medium cas_update lacks a generic backend-error regression test (crates/ironclaw_filesystem/src/cas.rs:330-341, confidence 78) — anchor: crates/ironclaw_filesystem/src/cas.rs:341
The helper has coverage for first write, no-op, contention, timeout, and unsupported-write, but not the fallback Err(error) => CasUpdateError::Backend(error) branch on a normal write failure. A backend returning a non-Unsupported FilesystemError from put would still be unpinned here.
Fix: Add tests::cas::backend_put_error_maps_to_backend covering a successful read followed by a generic backend error from put.
Low Discarded approval status is untested (crates/ironclaw_capabilities/src/helpers.rs:273-279, confidence 84) — anchor: crates/ironclaw_capabilities/src/helpers.rs:273
The new ApprovalStatus::Discarded arm has no direct regression test. Nothing asserts that discarded approvals map to ApprovalDiscarded, so this one-line branch could drift silently.
Fix: Add tests::helpers::approval_not_approved_error_kind_maps_discarded_status covering ApprovalStatus::Discarded -> "ApprovalDiscarded".

Performance

Low Last retry still pays a backoff sleep (crates/ironclaw_filesystem/src/cas.rs:333-334, confidence 89) — anchor: crates/ironclaw_filesystem/src/cas.rs:333
The loop always awaits cas_retry_backoff(attempt) on VersionMismatch, including the final permitted attempt. That adds an avoidable 2-50ms of tail latency to every exhausted CAS call on the shared hot path, with no chance of success because no retry remains.
Fix: Skip the sleep when attempt + 1 == FILESYSTEM_CAS_RETRIES and return RetriesExhausted immediately.

Local Patterns

Low Fix the equality-path example in the public API docs (crates/ironclaw_filesystem/src/cas.rs:105-108, confidence 91) — anchor: crates/ironclaw_filesystem/src/cas.rs:105
The public CasApply docs describe the unchanged-snapshot fast path as Some(existing) == snapshot, but snapshot is an S while current is the Option<S>. As written, the example is type-mismatched and makes the two no-op modes harder to distinguish for store authors reading the API surface.
Fix: Rewrite the sentence in terms of the actual comparison, for example current is Some(existing) and the returned snapshot equals existing.

henrypark133 · 2026-06-26T04:13:54Z

+                operation: FilesystemOperation::WriteFile,
+                ..
+            }) => return Err(CasUpdateError::CasUnsupported),
+            Err(error) => return Err(CasUpdateError::Backend(error)),


Medium — cas_update lacks a generic backend-error regression test.

The helper has coverage for first write, no-op, contention, timeout, and unsupported-write, but not the fallback Err(error) => CasUpdateError::Backend(error) branch on a normal write failure. A backend returning a non-Unsupported FilesystemError from put would still be unpinned here.

Fix: Add tests::cas::backend_put_error_maps_to_backend covering a successful read followed by a generic backend error from put.

Fixed in 4b061dd — added backend_put_error_maps_to_backend: a CAS-capable backend whose get succeeds and whose put returns a generic (non-VersionMismatch, non-Unsupported{WriteFile}) FilesystemError, asserting cas_update maps it to CasUpdateError::Backend(_). Covers the catch-all arm at cas.rs:341.

henrypark133 · 2026-06-26T04:13:54Z

+            Ok(_) => return Ok(outcome),
+            // 5a. Lost the CAS race — re-read and retry with backoff.
+            Err(FilesystemError::VersionMismatch { .. }) => {
+                cas_retry_backoff(attempt).await;


Low — Last retry still pays a backoff sleep.

The loop always awaits cas_retry_backoff(attempt) on VersionMismatch, including the final permitted attempt. That adds an avoidable 2-50ms of tail latency to every exhausted CAS call on the shared hot path, with no chance of success because no retry remains.

Fix: Skip the sleep when attempt + 1 == FILESYSTEM_CAS_RETRIES and return RetriesExhausted immediately.

Fixed in 4b061dd — the VersionMismatch arm now guards the sleep with if attempt + 1 < FILESYSTEM_CAS_RETRIES, so the final exhausted attempt returns RetriesExhausted immediately without the 2-50ms backoff. Retry count unchanged.

henrypark133 · 2026-06-26T04:13:54Z

        ApprovalStatus::Approved => "ApprovalApproved",
        ApprovalStatus::Denied => "ApprovalDenied",
        ApprovalStatus::Expired => "ApprovalExpired",
+        ApprovalStatus::Discarded => "ApprovalDiscarded",


Low — Discarded approval status is untested.

The new ApprovalStatus::Discarded arm has no direct regression test. Nothing asserts that discarded approvals map to ApprovalDiscarded, so this one-line branch could drift silently.

Fix: Add tests::helpers::approval_not_approved_error_kind_maps_discarded_status covering ApprovalStatus::Discarded -> "ApprovalDiscarded".

Fixed in 4b061dd — added approval_not_approved_error_kind_maps_discarded_status asserting ApprovalStatus::Discarded -> "ApprovalDiscarded".

henrypark133 · 2026-06-26T04:13:54Z

+/// Two no-op signals are available, covering the two cases where `apply` should
+/// skip the write entirely:
+///
+/// 1. **Existing record, unchanged snapshot** — return `CasApply::new` with


Low — Fix the equality-path example in the public API docs.

The public CasApply docs describe the unchanged-snapshot fast path as Some(existing) == snapshot, but snapshot is an S while current is the Option<S>. As written, the example is type-mismatched and makes the two no-op modes harder to distinguish for store authors reading the API surface.

Fix: Rewrite the sentence in terms of the actual comparison, for example current is Some(existing) and the returned snapshot equals existing.

Fixed in 4b061dd — the CasApply doc example no longer mismatches types; it now describes the actual check (current is Some(existing) and the returned snapshot equals existing), matching matches!(&current, Some(e) if *e == snapshot) and distinguishing it from the explicit CasApply::no_op path.

- cas/tests.rs: assert `UnsupportedWriteBackend::put` was actually invoked in `unsupported_write_file_maps_to_cas_unsupported` (proves the op-time fallback path, not just the final enum which pre-flight also yields). - cas/tests.rs: add `backend_put_error_maps_to_backend` covering the generic `put` failure → `CasUpdateError::Backend` catch-all arm. - cas.rs: skip the jittered backoff sleep on the final CAS attempt (`attempt + 1 < FILESYSTEM_CAS_RETRIES`) — no retry remains, so it was pure tail latency on the shared hot path. - cas.rs: fix the type-mismatched equality example in the `CasApply` docs (`current` is `Some(existing)` and the returned snapshot equals `existing`). - capabilities/helpers.rs: regression test for `ApprovalStatus::Discarded -> "ApprovalDiscarded"`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

henrypark133

Code Review (multi-agent)

Intent: Remove per-record filesystem mutex convoys by migrating Reborn filesystem-backed stores to a shared CAS update helper.

Stats: 4 findings (from 8 reviewer lanes; 1 High, 3 Medium) across 1 file. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0.

Prior-thread context was considered. I did not repost the already-addressed items or the soundly deferred follow-ups around Send bound creep, BackendCapabilities::default() as an explicit-design cleanup, approval retention/tombstone policy, clone-cost benchmarking, or turn-state decomposition in #5274.

Bugs

High Byte-only backends still slip through on first write (crates/ironclaw_filesystem/src/cas.rs:256-347, confidence 88) — anchor: crates/ironclaw_filesystem/src/cas.rs:260
The fail-closed CAS gate can bypass default-capability composite roots and then successfully first-write JSON byte snapshots on local/byte-only mounts via CasExpectation::Absent.

Tests

Medium Missing test for backend read failure propagation (crates/ironclaw_filesystem/src/cas.rs:292-298, confidence 93) — anchor: crates/ironclaw_filesystem/src/cas.rs:298
The helper's get error branch should be pinned as CasUpdateError::Backend.
Medium Missing test for malformed snapshot decode failure (crates/ironclaw_filesystem/src/cas.rs:293-295, confidence 98) — anchor: crates/ironclaw_filesystem/src/cas.rs:294
Malformed persisted bytes should be pinned as CasUpdateError::Apply at the helper boundary.
Medium Missing test for encode failure propagation (crates/ironclaw_filesystem/src/cas.rs:325-327, confidence 98) — anchor: crates/ironclaw_filesystem/src/cas.rs:327
Encoder failures should be pinned as CasUpdateError::Apply without writing.

henrypark133 · 2026-06-26T04:40:28Z

+    // *known* shape; the composite router advertises the empty default and
+    // defers to the op-time check below. Either way we never fall back to
+    // `CasExpectation::Any`.
+    let capabilities = filesystem.capabilities();


High — Byte-only backends still slip through on first write.

The fail-closed gate only rejects backends that advertise a non-default capability shape or that return Unsupported(WriteFile) from put. A composite filesystem reports BackendCapabilities::default() before path resolution, then routes the write to the mounted backend. LocalFilesystem accepts CasExpectation::Absent for plain byte entries, and CAS snapshots are JSON byte entries, so the first snapshot write on a local/byte-only mount can succeed instead of surfacing CasUnsupported. Later Version(...) writes fail, but the helper has already violated its no-CAS fail-closed contract for fresh records.

Fix: Do not infer CAS support from the root composite capabilities alone. Gate against path-resolved mount capabilities, or reject unknown/default roots before issuing any write so Absent writes cannot succeed on non-CAS mounts.

Fixed in 0d4bc1c — made all CAS-store encoders record-shaped (entry.kind = Some(RecordKind)), so LocalFilesystem's existing record-shaped rejection fires fail-closed uniformly on the Absent first-write too (turns was already protected; run_state/secrets/resources/threads were not). Self-enforcing at the data-model level, no new helper branches. Severity is really Medium — Absent is create-if-absent and all updates fail-closed, so there is no blind overwrite / lost update; only creation succeeded then updates failed loudly. New record_shaped_first_write_fails_closed_on_byte_only_backend + a store-level fail-closed test over a real byte-only LocalFilesystem mount; verified across the postgres/libsql/all-features matrix.

henrypark133 · 2026-06-26T04:40:28Z

+                (Some(decoded), Some(versioned.version))
+            }
+            Ok(None) => (None, None),
+            Err(error) => return Err(CasUpdateError::Backend(error)),


Medium — Missing test for backend read failure propagation.

The filesystem.get(...).await error arm now returns CasUpdateError::Backend, but the new CAS helper tests never inject a read failure. The suite covers contention, capability gating, timeout, and put failures, but not the branch that should surface a backend failure before a snapshot can be loaded.

Fix: Add tests::cas::get_error_is_returned_as_backend using a backend whose get returns a non-NotFound FilesystemError and assert cas_update returns CasUpdateError::Backend(_).

Fixed in 0d4bc1c — added backend_get_error_maps_to_backend pinning a get failure to CasUpdateError::Backend.

henrypark133 · 2026-06-26T04:40:28Z

+        //    committed version, never a torn one.
+        let (current, version) = match filesystem.get(scope, path).await {
+            Ok(Some(versioned)) => {
+                let decoded = decode(&versioned.entry.body).map_err(CasUpdateError::Apply)?;


Medium — Missing test for malformed snapshot decode failure.

The decode(&versioned.entry.body) failure path maps to CasUpdateError::Apply, but no test drives cas_update with invalid bytes in an existing record. That helper boundary should be pinned directly because malformed persisted snapshots are a distinct failure mode from apply-closure errors.

Fix: Add tests::cas::decode_error_is_returned_as_apply with an existing record containing malformed JSON and assert the decode error is returned through CasUpdateError::Apply.

Fixed in 0d4bc1c — added malformed_snapshot_decode_maps_to_apply pinning a decode failure to CasUpdateError::Apply.

henrypark133 · 2026-06-26T04:40:28Z

+
+        // 4. Encode + CAS put. Absent → create-if-absent; existing → version
+        //    precondition.
+        let entry = encode(&snapshot).map_err(CasUpdateError::Apply)?;


Medium — Missing test for encode failure propagation.

The encode(&snapshot) failure path also maps to CasUpdateError::Apply, but the current tests only use an encoder that succeeds. A serialization failure would therefore be unpinned at the shared helper boundary.

Fix: Add tests::cas::encode_error_is_returned_as_apply using an encoder that returns TestError after apply produces a changed snapshot, and assert CasUpdateError::Apply is returned without writing.

Fixed in 0d4bc1c — added encode_failure_maps_to_apply_without_write pinning an encode failure to CasUpdateError::Apply and asserting no put was attempted.

… tests Multi-agent review (High): cas_update's fail-closed guarantee had a gap on the first (absent) write. A default/unknown-capability backend skips the pre-flight gate, and LocalFilesystem.put rejects only record-shaped entries (entry.kind.is_some()) — so a BYTE-ONLY entry written with CasExpectation::Absent was accepted, slipping past the op-time fail-closed check. ironclaw_turns already encoded record-shaped entries (kind set) and was protected; run_state, secrets, resources, and threads encoded byte-only entries and were not. Fix (make the contract self-enforcing at the data-model level, not mount-config dependent): set entry.kind on every CAS-store encoder so they are record-shaped like turns. LocalFilesystem's existing record-shaped rejection then makes the op-time fail-closed fire uniformly on the Absent path too — no new branches in the shared helper. Reads are unaffected (body decode ignores kind); additive vs existing kind=None rows. - run_state: run-record + approval-record encoders set kind; store-level fail-closed regression test over a byte-only LocalFilesystem mount. - secrets: secret / lease / account / session encoders set kind. - resources: Snapshot trait gains RECORD_KIND; cas_snapshot encoder stamps it; ResourceGovernor + BudgetGate snapshots declare their kind. - threads: session / message / summary / idempotency encoders set kind. - cas/tests.rs: `record_shaped_first_write_fails_closed_on_byte_only_backend` proves the mechanism end-to-end (kind-gated put → CasUnsupported on first write), plus error-propagation tests: get-failure → Backend, decode-failure → Apply, encode-failure → Apply (no write). Note (kind identifiers): RecordKind validates as `[A-Za-z_][A-Za-z0-9_]*`, so snake_case kinds (matching turns' `turn_state_snapshot`), not kebab-case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

henrypark133 · 2026-06-26T05:06:06Z

Addressed the latest multi-agent review (1 High + 3 Medium tests) in 0d4bc1c.

High — byte-only backends slip through on first write. Valid, and a touch broader than flagged: LocalFilesystem.put rejects only record-shaped entries (entry.kind.is_some()) before the CAS check, so a byte-only entry written with CasExpectation::Absent was accepted — slipping past op-time fail-closed on the first write. ironclaw_turns already encoded record-shaped (kind set) and was protected; run_state/secrets/resources/threads encoded byte-only entries and were not.
Severity is really Medium, not data-integrity: Absent is create-if-absent (concurrent creates → one winner + VersionMismatch for the rest → retry hits the Version path → fail-closed), and all updates fail-closed, so there is no blind overwrite / lost update — the gap was only that creation succeeded, then every update failed loudly.
Fix (self-enforcing at the data-model level rather than mount-config dependent): set entry.kind on every CAS-store encoder so they are record-shaped like turns. LocalFilesystem’s existing record-shaped rejection then fires uniformly on the Absent path too — no new branches in the shared helper, and it removes the turns-vs-rest inconsistency. Additive/back-compatible (reads ignore kind; old kind=None rows still decode). New record_shaped_first_write_fails_closed_on_byte_only_backend proves the mechanism end-to-end; a store-level fail-closed test over a real byte-only LocalFilesystem mount was added in run_state. Verified across the postgres/libsql/all-features matrix.

Medium ×3 — missing error-path tests. Added: read failure → CasUpdateError::Backend (backend_get_error_maps_to_backend); malformed decode → Apply (malformed_snapshot_decode_maps_to_apply); encode failure → Apply with no write attempted (encode_failure_maps_to_apply_without_write).

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/ironclaw_threads/src/filesystem_service.rs (1)
1118-1125: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Validate the stored thread ID before returning an existing record.

Line 1120 only scope-checks existing; if the path contains a same-scope record for a different thread_id, ensure_thread returns the wrong thread as successfully ensured. Match read_thread_versioned’s identity check and reject mismatched IDs.
Proposed fix
-                        if existing.record.scope != scope {
+                        if existing.record.scope != scope || existing.record.thread_id != thread_id {
                             Err(SessionThreadError::ThreadScopeMismatch { thread_id })
                         } else {
As per coding guidelines, “Preserve tenant/user/agent/project/mission/thread scope on authority, state, memory, process, network, outbound, resource, and event records.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_threads/src/filesystem_service.rs` around lines 1118 - 1125,
The existing-return path in ensure_thread only checks scope and can incorrectly
accept a record whose stored thread_id does not match the requested thread_id.
Update the Some(existing) branch in filesystem_service::ensure_thread to also
validate existing.record.thread_id against the requested thread_id, matching the
identity check used by read_thread_versioned, and return a thread-mismatch error
when the IDs differ before returning CasApply::new(existing.clone(),
existing.record).
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 975-995: The byte-only mock in KindGatedByteOnlyBackend::put only
verifies the entry shape and ignores the cas argument, so the test still passes
even if cas_update stops passing CasExpectation::Absent for first writes. Update
the mock to capture and assert the CAS precondition alongside path and entry,
and ensure the test exercises the absent-write case through cas_update so every
argument the production caller passes is validated.

---

Outside diff comments:
In `@crates/ironclaw_threads/src/filesystem_service.rs`:
- Around line 1118-1125: The existing-return path in ensure_thread only checks
scope and can incorrectly accept a record whose stored thread_id does not match
the requested thread_id. Update the Some(existing) branch in
filesystem_service::ensure_thread to also validate existing.record.thread_id
against the requested thread_id, matching the identity check used by
read_thread_versioned, and return a thread-mismatch error when the IDs differ
before returning CasApply::new(existing.clone(), existing.record).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 7c6c776b-3c08-4c47-8ace-e275948a1864

📥 Commits

Reviewing files that changed from the base of the PR and between 4b061dd and 0d4bc1c.

📒 Files selected for processing (8)

crates/ironclaw_filesystem/src/cas/tests.rs
crates/ironclaw_resources/src/cas_snapshot.rs
crates/ironclaw_resources/src/filesystem_store.rs
crates/ironclaw_resources/src/lib.rs
crates/ironclaw_run_state/src/lib.rs
crates/ironclaw_run_state/tests/run_state_contract.rs
crates/ironclaw_secrets/src/filesystem_store.rs
crates/ironclaw_threads/src/filesystem_service.rs

- cas/tests.rs: the byte-only mock's `put` now captures `cas` and asserts `CasExpectation::Absent`, so the fail-closed regression also pins that cas_update uses Absent for the first (absent) write — not just the record-shaped op-time rejection. - threads/filesystem_service.rs: ensure_thread's existing-return branch now rejects a thread_id mismatch as well as a scope mismatch, matching read_thread_versioned (line 236) and the message-read guard. The thread path is thread_id-keyed so this is defense-in-depth parity rather than a reachable bug via the public API; bringing it in line with the canonical read path removes the inconsistency a reviewer flagged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

henrypark133 · 2026-06-26T05:56:56Z

Re the outside-diff Major on ironclaw_threads/filesystem_service.rs ensure_thread (validate stored thread ID): addressed in 878a44c. The Some(existing) branch now rejects a thread_id mismatch as well as a scope mismatch, matching read_thread_versioned (line 236) and the message-read guard (line 407).
Worth noting it is defense-in-depth, not a reachable bug today: thread_record_path is {scope}/{thread_id}/thread.json — thread_id is a literal path segment, so a CAS read at that path can only return a record written there under the same thread_id. The change removes the inconsistency with the canonical read paths rather than fixing a live mis-identification, so no fabricated regression test was added (the mismatch is unreachable via the public API).

henrypark133

Code Review (multi-agent)

Intent: Remove per-record filesystem lock convoys by migrating Reborn persistence stores to a shared CAS update helper.

Stats: 2 findings kept (from 6 raw final-head reviewer findings, 2 after live-thread/stale filtering) across 2 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 1.

Prior-thread context was considered. I filtered out the repeated CAS helper findings for byte-only backends and get/decode/encode error-path tests because the live threads already mark them fixed on the current series and the final head contains the corresponding tests.

Tests

Medium Cover consuming an already-expired lease (crates/ironclaw_secrets/src/filesystem_store.rs:498-504, confidence 86) — anchor: crates/ironclaw_secrets/src/filesystem_store.rs:498
The new SecretLeaseStatus::Expired branch in FilesystemSecretStore::consume returns LeaseExpired directly when the lease record is already marked Expired, but there is no regression test for that already-expired path. A zero-TTL or pre-aged lease could start rewriting state or surfacing the wrong error without a failing test.

Local Patterns

Low Drop the stale per-path lock description from the CAS snapshot docs (crates/ironclaw_resources/src/cas_snapshot.rs:8-10, confidence 95) — anchor: crates/ironclaw_resources/src/cas_snapshot.rs:8 (no diff position — body only)
The module docs still say the shared snapshot stores use an in-process per-path async lock that serializes same-process writers, but this module now routes updates through CasSnapshotStore and the shared lock-free cas_update helper. That wording is now inaccurate and can mislead future maintainers about the actual concurrency model, especially when adding new stores.

henrypark133 · 2026-06-26T06:12:58Z

+                        SecretLeaseStatus::Revoked => {
+                            Err(SecretStoreError::LeaseRevoked { lease_id })
+                        }
+                        SecretLeaseStatus::Expired => {


Medium — Cover consuming an already-expired lease.

The new SecretLeaseStatus::Expired branch in FilesystemSecretStore::consume returns LeaseExpired directly when the lease record is already marked Expired, but there is no regression test for that already-expired path. A zero-TTL or pre-aged lease could start rewriting state or surfacing the wrong error without a failing test.

Fix: Add a filesystem secret-store test that creates or marks an already-expired lease, calls consume, and asserts LeaseExpired is returned without issuing another write.

Fixed in 805bd15 — added consume_already_expired_lease_returns_expired_without_rewrite. It forces a genuine stored Expired state (lease TTL = -1s → first consume promotes Active→Expired with one write → second consume hits the already_marked branch), asserts Err(SecretStoreError::LeaseExpired), and captures the InMemoryBackend record version before/after to prove the already-marked path issues NO second write (version unchanged).

… LocalFilesystem CI regression from making the CAS-store encoders record-shaped (0d4bc1c): LocalFilesystem.put rejects record-shaped entries (entry.kind.is_some()) with Unsupported{WriteFile}, which cas_update maps to fail-closed CasUnsupported. Several reborn test harnesses backed run-state / approval / session-thread stores with a byte-only LocalFilesystem — a backend production never uses for these stores (the filesystem guardrail states all production store mounts resolve to CAS-capable db/in-memory backends; LocalFilesystem is structurally unreachable). So the failures are the intended fail-closed contract correctly rejecting a non-production backend, not a code regression. Fix the harnesses, not the code: route the CAS stores through a shared Arc<InMemoryBackend> (full BackendCapabilities, CAS-capable) which models the production database-backed filesystem. The durable-restart test keeps its three service-graph instances over one shared backend, preserving the state-survives-restart semantics; LocalFilesystem still backs the byte-only capability-lease / process services (which don't set RecordKind). No assertions weakened, no tests ignored. - crates/ironclaw_host_runtime/tests/reborn_durable_restart_integration.rs - tests/support/reborn/session_thread.rs - tests/support/reborn/harness.rs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…stale CAS doc - secrets: add `consume_already_expired_lease_returns_expired_without_rewrite`. The `SecretLeaseStatus::Expired` already-marked branch of `consume` returns LeaseExpired via the apply-error path with no write; the test forces a stored Expired state (TTL=-1 → first consume promotes to Expired → second consume hits the already-marked path) and asserts the backend record version does NOT bump (no second write). - resources/cas_snapshot.rs: module doc no longer describes an in-process per-path async lock — it now routes through the lock-free shared cas_update helper (versioned CAS + retry on VersionMismatch + jittered backoff). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

henrypark133 · 2026-06-26T16:36:26Z

Re the Low (body-only) finding on ironclaw_resources/src/cas_snapshot.rs module docs (stale per-path lock description): fixed in 805bd15. The header no longer mentions an in-process per-path async lock serializing same-process writers — it now describes the actual model: lock-free optimistic concurrency via the shared cas_update helper (versioned compare-and-swap, retry on VersionMismatch with bounded jittered backoff). Verified no other stale lock/serializ mentions remain in the module docs (the remaining hits are accurate method/test notes).

henrypark133 and others added 7 commits June 24, 2026 19:01

railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 25, 2026 11:53 Destroyed

github-actions Bot added scope: docs Documentation scope: dependencies Dependency updates size: XL 500+ changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Jun 25, 2026

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 25, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

henrypark133 mentioned this pull request Jun 25, 2026

docs(reborn): design — write-behind lease durability + side-effect gate #5249

Open

henrypark133 mentioned this pull request Jun 25, 2026

Heartbeat lease write-behind: move runner-lease renewal off the synchronous Postgres path #5253

Open

railway-app Bot had a problem deploying to ironclaw-ci-preview / ironclaw-pr-5234 June 25, 2026 22:24 Failure

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread crates/ironclaw_filesystem/src/cas.rs

Comment thread crates/ironclaw_run_state/tests/run_state_contract.rs

Comment thread docs/plans/2026-06-25-cas-migration.md Outdated

henrypark133 mentioned this pull request Jun 25, 2026

Migrate ironclaw_turns runner-lease sidecar CAS loops onto shared cas_update #5274

Open

railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 25, 2026 23:30 Destroyed

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

henrypark133 commented Jun 26, 2026

View reviewed changes

railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 03:42 Destroyed

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread crates/ironclaw_filesystem/src/cas/tests.rs Outdated

henrypark133 commented Jun 26, 2026

View reviewed changes

railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 04:21 Destroyed

henrypark133 commented Jun 26, 2026

View reviewed changes

railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 05:05 Destroyed

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread crates/ironclaw_filesystem/src/cas/tests.rs

railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 05:56 Destroyed

henrypark133 commented Jun 26, 2026

View reviewed changes

railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 06:28 Destroyed

railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 16:36 Destroyed

Merge remote-tracking branch 'origin/main' into fix/reborn-p1-cas-helper

23be54b

railway-app Bot deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 16:50 View deployment

Uh oh!

Conversation

henrypark133 commented Jun 25, 2026

Problem

Change

Tri-backend parity (no diverging experience by host)

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

railway-app Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

henrypark133 commented Jun 25, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

henrypark133 left a comment

Choose a reason for hiding this comment

Code Review (multi-agent)

Security

Tests

Performance / Concurrency

Conventions

Maintainability

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

railway-app Bot commented Jun 25, 2026 •

edited

Loading